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Support Vector Machine For Functional Data 

Classification 



Abstract 

In many applications, input data are sampled functions taking their values in infi- 
nite dimensional spaces rather than standard vectors. This fact has complex con- 
sequences on data analysis algorithms that motivate modifications of them. In fact 
most of the traditional data analysis tools for regression, classification and cluster- 
ing have been adapted to functional inputs under the general name of Functional 
Data Analysis (FDA). In this paper, we investigate the use of Support Vector Ma- 
chines (SVMs) for functional data analysis and we focus on the problem of curves 
discrimination. SVMs are large margin classifier tools based on implicit non linear 
mappings of the considered data into high dimensional spaces thanks to kernels. We 
show how to define simple kernels that take into account the functional nature of 
the data and lead to consistent classification. Experiments conducted on real world 
data emphasize the benefit of taking into account some functional aspects of the 
problems. 

Key words: Functional Data Analysis, Support Vector Machine, Classification, 
Consistency 
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1 Introduction 



In many real world applications, data should be considered as discretized func- 
tions rather than as standard vectors. In these applications, each observation 
corresponds to a mapping between some conditions (that might be implicit) 
and the observed response. A well studied example of those functional data 
is given by spectrometric data (see section 6.3): each spectrum is a function 
that maps the wavelengths of the illuminating light to the corresponding ab- 
sorbances (the responses) of the studied sample. Other natural examples can 
be found in voice recognition area (see sections 6.1 and 6.2) or in meteorolog- 
ical problems, and more generally, in multiple time series analysis where each 
observation is a complete time series. 



The direct use of classical models for this type of data faces several difficul- 
ties: as the inputs are discretized functions, they are generally represented by 
high dimensional vectors whose coordinates are highly correlated. As a con- 
sequence, classical methods lead to ill-posed problems, both on a theoretical 
point of view (when working in functional spaces that have infinite dimen- 
sion) and on a practical one (when working with the discretized functions). 
The goal of Functional Data Analysis (FDA) is to use, in data analysis al- 
gorithms, the underlying functional nature of the data: many data analysis 
methods have been adapted to functions (see [i^ for a comprehensive intro- 
duction to functional data analysis and a review of linear methods). While the 
original papers on FDA focused on linear methods such as Principal Compo- 
nent Analysis 0, i, i, [J and the linear model [30, lla, llSj , non linear models 
have been studied extensively in the recent years. This is the case, for instance, 
of most neural network models 



14, 31, 32, 33 



In the present paper, we adapt Support Vector Machines (SVMs, see e.g. [421 . 
0]) to functional data classification (the paper extends results from 34. |44|). 
We show in particular both the practical and theoretical advantages of using 
functional kernels, which are kernels that take into account the functional 
nature of the data. On a practical point of view, those kernels allow to take 
advantage of the expert knowledge on the data. On the theoretical point of 
view, a specific type of functional kernels allows the construction of a consistent 
training procedure for functional SVMs. 



The paper is organized as follow: section 2 presents the functional data clas- 
sification and why it generally leads to ill-posed problems. Section 3 provides 
a short introduction to SVMs and explains why their generalization to FDA 
can lead to particular problems. Section 4 describes several functional kernels 
and explains how they can be practically computed while section 5 presents a 
consistency result for some of them. Finally, section 6 illustrates the various 
approaches presented in the paper on real data sets. 
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2 Functional Data Analysis 



2. 1 Functional Data 



To simplify the presentation, this article focuses on functional data for which 
each observation is described by one function from M to M. Extension to the 
case of several real valued functions is straightforward. More formally, if /i de- 
notes a known finite positive Borel measure on M, an observation is an element 
of L'^ifi), the Hilbert space of /i-square-integrable real valued functions defined 
on M. In some situations, additional regularity assumptions (e.g., existence of 
derivatives) will be needed. 

However, almost all the developments of this paper are not specific to func- 
tions and use only the Hilbert space structure of LF'^ij). We will therefore 
denote X an arbitrary Hilbert space and (., .) the corresponding inner prod- 
uct. Additional assumptions on X will be given on a case by case basis. As 
stated above, the most common situation will of course he X = LF'{^) with 
(m, v) = J uvdfi. 



2.2 Data analysis methods for Hilbert spaces 



It should be first noted that many data analysis algorithms can be written 
so as to apply, at least on a theoretical point of view, to arbitrary Hilbert 
spaces. This is obviously the case, for instance, for distance-based algorithms 
such as the fc-nearest neighbor method. Indeed, this algorithm uses only the 
fact that distances between observations can be calculated. Obviously, it can 
be applied to Hilbert spaces using the distance induced by the inner product. 
This is also the case of methods directly based on inner products such as multi- 



layer perceptrons (see [35|, |36|, |4l[ for a presentation of multi-layer perceptrons 



with almost arbitrary input spaces, including Hilbert spaces). 

However, functional spaces have infinite dimension and a basic transposition 
of standard algorithms introduces both theoretical and practical difficulties. 
In fact, some simple problems in M'* become ill-posed in X when the space has 
infinite dimension, even on a theoretical point of view. 

Let us consider for instance the linear regression model in which a real valued 
target variable Y is modeled by E{Y\X) = H{X) where if is a linear con- 
tinuous operator defined on the input space. When X has values in R'^ (i.e., 
X = M.'^), H can be easily estimated by the least square method that leads 
to the inversion of the covariance matrix of X. In practice, problems might 
appear when d is not small compared to A^, the number of available exam- 



3 SUPPORT VECTOR MACHINES FOR FDA 



5 



pies, and regularization techniques should be used (e.g., ridge regression |2l[|). 
When X has values in a Hilbert space, the problem is ill-posed because the 
covariance of X is a Hilbert-Schmidt operator and thus has no continuous in- 
verse; direct approximation of the inverse of this operator is then problematic 
as it does not provide a consistant estimate (see [^). 

To overcome the infinite dimensional problem, most of FDA methods so far 
have been constructed thanks to two general principles: filtering and regular- 
ization. In the filtering approach, the idea is to use representation methods 
that allow to work in finite dimension (see for instance for the functional 
linear model and [3] for a functional fc-nearest neighbor method). In the reg- 
ularization approach, the complexity of the solution is constrained thanks to 
smoothness constraints. For instance, building a linear model in a Hilbert space 
consists in finding a function h G L'^ifJ') such that E{Y\X) = {h,X). In the 
regularization approach, h is chosen among smooth candidates (for instance 
twice derivable functions with minimal curvature), see e.g. [3, 0,0]. Other 
examples of the regularization approach include smooth Principal Component 
Analysis ji^] and penalized Canonical Component Analysis [23]. A compar- 
ison of filtering and regularization approaches for a semi-parametric model 



used in curve discrimination can be found in 13 . 



Using both approaches, a lot of data analysis algorithms have been successfully 
adapted to functional data. Our goal in the present paper is to study the case 
of Support Vector Machines (SVM), mainly thanks to a filtering approach. 



3 Support Vector Machines for FDA 



3. 1 Support Vector Machines 



We give, in this section, a very brief presentation of Support Vector Machines 
(SVMs) that is needed for the definition of their functional versions. We re- 
fer the reader to e.g. [3| for a more comprehensive presentation. As stated in 
section 2.1, X denotes an arbitrary Hilbert space. Our presentation of SVM 
departs from the standard introduction because it assumes that the observa- 
tions belong to X rather than to a W^. This will make clear that the definition 
of SVM on arbitrary Hilbert spaces is not the difficult part in the construction 
of functional SVM. We will discuss problems related to the functional nature 
of the data in section 3.2. 

Our goal is to classify data into two predefined classes. We assume given a 
learning set, i.e. N examples {xi,yi), . . . , {xn^Un) which are i.i.d. realizations 
of the random variable pair (X, Y) where X has values in X and F in {— 1, 1}, 
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i.e. Y is the class label for X which is the observation. 



3.1.1 Hard margin SVM 

The principle of SVM is to perform an affine discrimination of the observations 
with maximal margin, that is to find an element w & X with a minimum norm 
and a real value b, such that yi{{w, Xi) + 6) > 1 for all i. To do so, we have to 
solve the following quadratic programming problem: 

(Po) min(w, w), subject to yi{{w,Xi) + &)>!, 1 < i < A^. 

w,b 

The classification rule associated to {w,b) is simply /(x) = sign((w,x) + b). 
In this situation (called hard margin SVM), we request the rule to have zero 
error on the learning set. 



3.1.2 Soft margin SVM 

In practice, the solution provided by problem (Pq) is not very satisfactory. 
Firstly, perfectly linearly separable problems are quite rare, partly because 
non linear problems are frequent, but also because noise can turn a linearly 
separable problem into a non separable one. Secondly, choosing a classifier 
with maximal margin does not prevent overfitting, especially in very high 



dimensional spaces (see e.g. [l9| for a discussion about this point). 



A first step to solve this problem is to allow some classification errors on the 
learning set. This is done by replacing (Pq) by its soft margin version, i.e., by 
the problem: 

(Pc) min^^b,i{w, w) + CE^Ii 6, 

subject to yi{{w, Xj) + 6) > 1 — ^j, I < i < N, 
> 0, l<i<N. 



Classification errors are allowed thanks to the slack variables ^j. The C pa- 
rameter acts as an inverse regularization parameter. When C is small, the cost 
of violating the hard margin constraints, i.e., the cost of having some > 
is small and therefore the constraint on w dominates. On the contrary, when 
C is large, classification errors dominate and (Pc) gets closer to (Pq)- 



3.1.3 Nonlinear SVM 

As noted in the previous section, some classification problems don't have a 
satisfactory linear solution but have a non linear one. Non linear SVMs are 
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obtained by transforming the original data. Assume given an Hilbert space Ti 
(and denote (.,.)■« the corresponding inner product) and a function from 
X to Ti (this function is called a feature map). A linear SVM in Ti can be 
constructed on the data set (0(a;i),yi), . . . , (0(xAr), ?/iv). If is a non linear 
mapping, the classification rule f{x) = sign((ty, (f){x))'n + h) is also non linear. 

In order to obtain the linear SVM in Ti one has to solve the following opti- 
mization problem: 

(Pc,w) min^,fe,^(w,w)7^ + CEili^, 

subject to ?/i((w, (l){xi))'H + h)>l-ii, I <i < N, 
6 > 0, l<i<N. 

It should be noted that this feature mapping allows to define SVM on almost 
arbitrary input spaces. 



3.1.4 Dual formulation and Kernels 

Solving problems (Pc) or {Pen) might seem very difficult at first, because 
X and Ti are arbitrary Hilbert spaces and can therefore have very high or 
even infinite dimension (when A" is a functional space for instance). However, 
each problem has a dual formulation. More precisely, {Pc) is equivalent to the 
following optimization problem (see [3]): 

{Dc) max„ Y.f=i "i - E^Ii E^i aiajyiyj{xi, xj), 
subject to 'ZiLi chilli = 0, 

< ai < C, l<i<N. 

This result applies to the original problem in which data are not mapped into 
H, but also to the mapped data, i.e., (Pen) is equivalent to a problem {Dc,h) 
in which the Xi are replaced by 0(xj) and in which the inner product of Ti is 
used. This leads to: 

{Dc,n) maxo Y.f=i - T.f=i E^i a»ttj2/i%(0(a;»), <P{.Xj))n, 
subject to Eili OiiUi = 0, 

< < C, l<i<N. 

Solving {Dc^n) rather than (Pen) has two advantages. The first positive as- 
pect is that {Dc,n) is an optimization problem in rather than in Ti, which 
can have infinite dimension (the same is true for X). 
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The second important point is linked to the fact that the optimal classi- 
fication rule can be written /(x) = sign{J2iLiCeiyi{(f){xi),(l){x))n + b). This 
means that both the optimization problem and the classification rule do not 
make direct use of the transformed data, i.e. of the All the calculations 

are done through the inner product in Ti, more precisely through the values 
{(l){xi),(j){xj))T-i. Therefore, rather than choosing directly H and (p, one can 
provide a so called Kernel function K such that K{xi,Xj) = {(f){xi), (f){xj))'n 
for a given pair {TC, 0). 

In order that K corresponds to an actual inner product in a Hilbert 
space, it has to fulfill some conditions. K has to be symmetric and pos- 
itive definite, that is, for every A^, xi,...,xn in X and ai,...,aN in M, 
Y.iLiY.f=iO:iC^jK{xijXj) > 0. li K satisfies those conditions, according to 
Moore- Aronszajn theorem there exists a Hilbert space Ti. and feature map 
such that K{xi,Xj) = {(t){xi),(t){xj))'H- 



3.2 The case of functional data 



The short introduction to SVM proposed in the previous section has clearly 
shown that defining linear SVM for data in a functional space is as easy as 
for data in M'^, because we only assumed that the input space was a Hilbert 
space. By the dual formulation of the optimization problem {Pc)-, a software 
implementation of linear SVM on functional data is even possible, by relying 
on numerical quadrature methods to calculate the requested integrals (inner 
product in L^{fi), cf section 4.3). 

However, the functional nature of the data has some effects. It should be first 
noted that in infinite dimensional Hilbert spaces, the hard margin problem 
(Pq) has always a solution when the input data are in general positions, i.e., 
when A^ observations span a A^ dimensional subspace of A". A very naive solu- 
tion would therefore consists in avoiding soft margins and non linear kernels. 
This would not give very interesting results in practice because of the lack of 
regularization (see for some examples in very high dimension spaces, as 
well as section 6.1). 

Moreover, the linear SVM with soft margin can also lead to bad performances. 



It is indeed well known (see e.g. 20|) that problem (Pc) is equivalent to the 



following unconstrained optimization problem: 



N 



(Rx) min^ Vmax (0, 1 - yi{{w,Xi) + b)) + \{w,w), 

w,b iV 

with A = T^r. This way of viewing {Pc) emphasizes the regularization aspect 



see also [371, l38|, ll2|) and links the SVM model to ridge regression [2l|. As 
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shown in p/7[, the penahzation used in ridge regression behaves poorly with 
functional data. Of course, the loss function used by SVM (the hinge loss, 
i.e., h{u,v) = max(0, 1 — uv)) is different from the quadratic loss used in 
ridge regression and therefore no conclusion can be drawn from experiments 
reported in However they show that we might expect bad performances 
with the linear SVM applied directly to functional data. We will see in sections 
6.1 and 6.2 that the efficiency of the ridge regularization seems to be linked 
with the actual dimension of the data: it does not behave very well when the 
number of discretization points is very big and thus leads to approximate the 
ridge penalty by a dot product in a very high dimensional space (see also 
section 4.3). 

It is therefore interesting to consider non linear SVM for functional data, by 



introducing adapted kernels. As pointed out in e.g. [l2|, {Pc,h) is equivalent 

to 

1 ^ 

{R\,n) min — inax (0, 1 - yif{xi))) + A(/, f)n- 

Using a kernel corresponds therefore both to replace a linear classifier by a non 
linear one, but also to replace the ridge penalization by a penalization induced 
by the kernel which might be more adapted to the problem (see [38| for links 
between regularization operators and kernels). The applications presented in 
6 illustrate this fact. 



4 Kernels for FDA 



4-1 Classical kernels 



Many standard kernels for data are based on the Hilbert structure of 
M'^ and can therefore be applied to any Hilbert space. This is the case for 
instance of the Gaussian kernel (based on the norm in X: K{u, v) = e~'^""~^" ) 
and of the polynomial kernels (based on the inner product in X: K{u,v) = 
(1 + {u, v))^). Obviously, the only practical difficulty consists in implementing 
the calculations needed in X so as to evaluate the chosen kernel (the problem 
also appears for the plain linear "kernel", i.e. when no feature mapping is 
done). Section 4.3 discusses this point. 



4-2 Using the functional nature of the data 



While the functional version of the standard kernels can provide an interesting 
library of kernels, they do not take advantage of the functional nature of the 



4 KERNELS FOR FDA 



10 



data (they use only the Hilbert structure of L^(/i)). Kernels that use the fact 
that we are dealing with functions are nevertheless quite easy to define. 

A standard method consists in introducing kernels that are made by a compo- 
sition of a simple feature map with a standard kernel. More formally, we use 
a transformation operator P from X to another space V on which a kernel K 
is defined. The actual kernel Q on A" is defined as Q{u,v) = K{P{u), P{v)) 
(if i^' is a kernel, then so is Q). 



4 .2.1 Functional transformations 

In some application domains, such as chemometrics, it is well known that the 
shape of a spectrum (which is a function) is sometimes more important than 
its actual mean value. Several transformations can be proposed to deal with 
this kind of data. For instance, if /i is a finite measure (i.e., /i(M) < oo), a 
centering transformation can be defined as the following mapping from L'^{n) 
to itself: ^ 

Ciu) = u -— / -udu. 

^ ' /i(M) J ^ 

A normalization mapping can also be defined: 

If the functions are smooth enough, i.e., if we restrict ourselves to a Sobolev 
space VT**'^, then some derivative transformations can be used: the Sobolev 
space W^'"^, also denoted if'^, is the Hilbert space of functions which have 
derivatives up to the order s (in the sense of the distribution theory). For 
instance, with s > 2, we can use the second derivative that allows to focus 
on the curvature of the functions: this is particularly useful in near infrared 



spectrometry (see e.g., |3l|, ISSj, and section 6.3). 



4-2.2 Projections 

Another type of transformations can be used in order to define adapted kernels. 
The idea is to reduce the dimensionality of the input space, that is to apply 
the standard filtering approach of FDA. We assume given a d-dimensional 
subspace Vd of X and an orthonormal basis of this space denoted {\I/j}j=i,...,d. 
We define the transformation Py^ as the orthogonal projection on V^, 

^v.(^) = E(a^,^.)^r 

i=i 

(Yd, {■, ■)x) is isomorphic to (M"^, (., .)]Rd) and therefore one can use a standard 
SVM on the vector data ((x, \E'i), . . . , (x, \l/d)). This means that K can 
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be any kernel adapted to vector data. In the case where K is the usual dot 



product of M , this kernel is known as the empirical kernel map (see 43|] for 
further details in the field of protein analysis). 

Obviously, this approach is not restricted to functional data, but the choice 
of Vd can be directed by expert knowledge on the considered functions and 
we can then consider that it takes advantage of the functional nature of the 
data. We outline here two possible solutions based on orthogonal basis and on 
B-spline basis. 

If X is separable, it has a Hilbert basis, i.e., a complete orthonormal system 
{\E'j}j>i. Therefore one can define Vd as the space spanned by {'^j}j=i,...,d- The 
choice of the basis can be based on expert considerations. Good candidates 
include Fourier basis and wavelet basis. If the signal is known to be non sta- 
tionary, a wavelet based representation might for instance give better results 
than a Fourier representation. Once the basis is chosen, an optimal value for 
d can be derived from the data, as explained in section 5, in such a way that 
the obtained SVM has some consistency properties. Moreover, this projection 
approach gives good results in practice (see section 6.1). 

Another solution is to choose a projection space that has interesting practical 
properties, for instance a spline space with its associated B-sphne bases. Spline 
functions regularity can be chosen a priori so as to enforce expert knowledge 
on the functions. For instance, near infrared spectra are smooth because of the 
physical properties of the light transmission (and reflection). By using a spline 
representation of the spectra, we replace original unconstrained observations 
by approximations {k depends on what kind of smoothness hypothesis can 
be done). This projection can also be combined with a derivative transforma- 
tion operation (as proposed in section 4.2.1). 



4-3 Functional data in practice 



In practice, the functions {xi)i<i<N are never perfectly known. It is therefore 
difficult to implement exactly the functional kernels described in this section. 

The best situation is the one in which d discretization points have been cho- 
sen in M, {tk)i<k<d, and each function xi is described by a vector of M.'^, 
{xi(ti), . . . ,Xi{td))- In this situation, a simple solution consists in assuming 
that standard operations in R"' (linear combinations, inner product and norm) 
are good approximations of their counterparts in the considered functional 
space. When the sampling is regular, this is equivalent to applying standard 
SVMs to the vector representation of the functions (see section 6 for real world 
examples of this situation). When the sampling is not regular, integrals should 
be approximated thanks to a quadrature method that will take into account 
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the relative position of the sampling points. 



In some application domains, especially medical ones (e.g., |22|]), the situation 
is not as good. Each function is in general badly sampled: the number and 
the location of discretization points depend on the function and therefore a 
simple vector model is not anymore possible. A possible solution in this context 
consists in constructing a approximation of Xj based on its observation values 
(thanks to e.g., B-splines) and then to work with the reconstructed functions 
(see L 



33| for details). 



The function approximation tool used should be simple enough to allow easy 
implementation of the requested operations. This is the case for instance for 
B-splines that allow in addition derivative calculations and an easy imple- 
mentation of the kernels described in section 4.2.1. It should be noted that 
spline approximation is different from projection on a spline subspace. Indeed 
each sampled function could be approximated on a different B-spline basis, 
whereas the projection operator proposed in section 4.2.2 requests an unique 
projection space and therefore the same B-spline basis for each input function. 
In other words, the spline approximation is a convenient way of representing 
functions (see section 6.3 for an application to real world data), whereas the 
spline projection corresponds to a data reduction technique. Both aspects can 
be combined. 



5 Consistency of functional SVM 



5. 1 Introduction 



In this section we study one of the functional kernel described above and 
show that it can be used to define a consistent classifier for functional data. 
We introduce first some notations and definitions. 

Our goal is to define a training procedure for functional SVM such that the 
asymptotic generalization performances of the constructed model is optimal. 
We define as usual the generalization error of a classifier / by the probability 
of misclassification: 

L/ = P(/(V)^F). 

The minimal generalization error is the Bayes error achieved by the optimal 
classifier /* given by 

f 1 when P(V = 1 | X = x) > 1/2 
—1 otherwise. 
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We denote L* = Lf* the optimal Bayes error. Of course, the closer the error 
of a classifier is from L*, the better its generalization ability is. 

Suppose that we are given a learning sample of size defined as in section 3.1. 
A learning procedure is an algorithm which allows the construction, from 
this learning sample, of a classification rule chosen in a set of admissible 
classifiers. This algorithm is said to be consistent if 

LJn > L . 

It should be noted that when the data belong to M*^, SVMs don't always 
provide consistent classifiers. Some sufficient conditions have been given in 
[401 ]: the input data must belong to a compact subset of M'', the regularization 
parameter {C in {Pch)) has to be chosen in specific way (in relation to A^ 



and to the type of kernel used) and the kernel must be universal [39[ . If is 
the feature map associated to a kernel K, the kernel is universal if the set of 
all the functions of the form x ^ {w, for w Eli. is dense in the set of all 
continuous functions defined on the considered compact subset. In particular, 
the Gaussian kernel with any a > is universal for all compact subsets of R'^ 
(see jiol for futher details and the proof of Theorem 1 for the precise statement 
on C). 



5.2 A learning algorithm for functional SVM 

The general methodology proposed in [3| allows to turn (with some adapta- 
tions) a consistent algorithm for data in into a consistent algorithm for 
data in A", a separable Hilbert space. We describe in this section the adapted 
algorithm based on SVM. 

The methodology proposed in Q is based on projection operators described 
in section 4.2.2, more precisely on the usage of a Hilbert basis of X . In order 
to build a SVM classifier based on A^ examples, one need to choose from the 
data several parameters (in addition to the weights {aj}i<j<Ar and h in problem 

(1) the projection size parameter d, i.e., the dimension of the subset Vd on 
which the functions are projected before being submitted to the SVM 
(recall that Vd is the space spanned by j}j=i,...,d)] 

(2) C, the regularization parameter; 

(3) the fully specified kernel A', that is the type of the universal kernel (Gaus- 
sian, exponential, etc.) but also the parameter of this kernel such as a 
for the Gaussian kernel K{u^ v) = e~°" . 

Let us denote A the set of lists of parameters to explore (see section 5.3 for 
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practical examples). Following ^ we use a validation approach to choose the 
best list of parameters a & A and in fact the best classifier on the validation 
set. 

The data are split into two sets: a training set {(xj, i = 1, . . . , In} and a val- 
idation set {{xi, yi),i = In + ■ ■ ■ , N}. For each fixed list a of parameters, the 
training set {{xi,yi),i = 1, . . . ,In} is used to calculate the SVM classification 
rule fa{x) = sign (e!=i a*yiK{Pv^{x), Pv^ixi)) + b*) where ({a* }i<i<i^, b*) is 
the solution of {Dc\n) applied to the projected data {Pv^{xi),i = 1, . . . Jn} 
(please note that everything should be indexed by a, for instance one should 
write Ka rather than K). 

The validation set is used to select the optimal value of a in A, a*, according 
to estimation of the generalization error based on a penalized empirical error, 
that is, we define 

* • J" I An 

a = arg mm Lja + 



where 



1 ^ 



and Aa is a penalty term used to avoid the selection of the most complex 
models (i.e., the one with the highest d in general). The classifier Jn is then 
chosen as /a? = fa*- 



5.3 Consistency 



Under some conditions on A, the algorithm proposed in the previous section 
is consistent. We assume given a fixed Hilbert basis of the separable Hilbert 
space X, {^E'j}j>i. When the dimension of the projection space Vd is cho- 
sen, a fully specified kernel K has to be chosen in a finite set of kernels, JT^. 
The regularization parameter C can be chosen in a bounded interval of the 



form [0,Crf], for instance thanks to the algorithm proposed in [19| that allows 
to calculate the validation performances for all values of C in a finite time. 
Therefore, the set A can be written \Jd>i{d} ^ Jd^ [0,CJ. An element of A 
is a triple a = {d, K, C) that specifies the projection operator Py^, the kernel 
K (including all its parameters) and the regularization constant C . 

Let us first define, for all e > 0, M{l-L, e) the covering number of the Hilbert 
space Ti which is the minimum number of balls with radius e that are needed 



to cover the whole space Ti (see e.g., chapter 28 of [ill]). Note that in SVM, 
as Ti is induced by a kernel K, this number is closely related to the kernel (in 
particular because the norm used to defined the balls is induced by the inner 
product of Ti, that is by K itself); in this case, we will then denote the covering 
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number M{K,e). For example, Gaussian kernels are known to induce feature 
spaces with covering number of the form (9(e~'^) where d is the dimension of 
the input space (see (4ol|). 

Then we have: 

Theorem 1 We assume that X takes its values in a hounded subspace of the 
separable Hilbert space X . We suppose that, 

\/d > 1, Jd is a finite set, 

3Kfi G J^d such that: is universal, 

3z/,>0: N{Kd,e) = 0{e-^'^), 

Cd > 1, 

and that 



and finally that 



d>l 



lim /at = +00 lim N — 1^ = +oo 



lim — ; = 0. 

Af^+oo N -In 

Then, the functional SVM f^ = fa* chosen as described in section 5.2 (where 
a* is optimal in A= Ud>i{'^} ^ Jd^ [0,Cd]j is consistent that is: 

LjN y L . 



The proof of this result is given in Appendix A. It is close to the proof given in 
[il except that in [3i] the proof follows from an oracle inequality given for a finite 
grid search model. The grid search is adapted to the classifier used in the paper 
(a /c-nearest neighbor method), but not to our setting. Our result includes the 
search for a parameter C which can belong to an infinite and non countable 
set; this can be done by the use of the shatter coefficient of a particular class 
of linear classifiers which provides the behavior of the classification rule on a 
set of — /at observations (see (llf). 

As pointed out before, the Gaussian kernel satisfies the hypothesis of the 
theorem. Therefore, if contains a Gaussian kernel for all d, then consistency 
of the whole procedure is guaranteed. Other non universal kernels can of course 
be included in the search for the optimal model. 

Remark 1 Note that, in this theorem, the sets Jd and [0, Cd\ depend on d: 
this does not influence the consistency of the method. In fact, one could have 
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chosen the same set for every d, and Jd could also contain a single Gaussian 
kernel with any parameter a > 0. In practice however, this additional flexibility 
is very useful to adapt the model to the data, for instance by choosing on the 
validation set an optimal value for a with a Gaussian kernel. 



6 Applications 

We present, in this section, several applications of the functional SVM models 
described before to real world data. The first two applications illustrate the 
consistent methodology introduced in section 5.2: one has an input variable 
with a high number of discretization points and the second have much less 
discretization points. Those applications show that more benefits are obtained 
from the functional approach when the data can be reasonably considered as 
functions, that is when the number of discretization points is higher than the 
number of observations. 

The last application deals with spectrometric data and allows to show how a 
functional transformation (derivative calculation) can improve the efficiency 
of SVMs. For this application, we do not use the consistent methodology but 
a projection on a spline space that permits easy derivative calculations. 

For simplicity reasons, the parameter C is chosen among a finite set of values 
(in general less than 10 values) growing exponentially (for instance 0.1, 1, 10, 
...). In each simulation, the kernel family is fixed (e.g., Gaussian kernels). 
A finite set of fully specified candidate kernels are chosen in this family (for 
instance approximately 10 values of a in the case of the Gaussian kernel family) 
and the best kernel is selected as described in the previous section. 

6.1 Speech recognition 

We first illustrate in this section the consistent learning procedure given in 
section 5. We compare it to the original procedure based on k-mi described 
in . In practice, the only difference between the approaches is that we use 
a SVM whereas [Sj uses a /c-nn. 

The problems considered in [sl consist in classifying speech sample^l]- There 




are three problems with two classes each: classifying "yes" against "no", 
"boat" against "goat" and "sh" against "ao" . For each problem, we have 100 
functions. Table 1 gives the sizes of the classes for each problem. 




Data are available at 



http : / / www . math . uiiiv-inoiitp2 . f r/^iau/bbwdata . tgz 
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Problem 


Class 1 


Class —1 


yes/no 


48 


52 


boat/goat 


55 


45 


sh/ao 


42 


58 



Table 1 

Sizes of the classes 

Each function is described by a vector in M^^^^ which corresponds to a digi- 
tized speech frame. The goal of this benchmark is to compare data processing 
methods that make minimal assumptions on the data: no prior knowledge is 
used to preprocess the data. 

In order to directly compare to results from , performances of the algorithms 
are assessed by a leave-one-out procedure: 99 functions are used as the learning 
set (to which the split sample procedure is applied to choose SVM) and the 
remaining function provides a test example. 

While the procedure described in 5.2 allows to choose most of the parameters, 
both the basis {^E'j}j>i and the penalty term can be freely chosen. To focus 
on the improvement provided by SVM over k-nn, we have used the same 
elements as j^. As the data are temporal patterns, [sj relies on the Fourier 
basis (moreover, the Fast Fourier Transform allows an efficient calculation 
of the coordinates of the data on the basis). The penalty term is for all 
d below 100 and a high value (for instance 1000) for d > 100. This allows 
to only evaluate the models for d < 100 because the high value of for 
higher d prevents the corresponding models to be chosen, regardless of their 
performances. As pointed out in jsj, this choice appears to be safe as most of 
the dimensions then selected are much smaller than 50. 

The last free parameter is the split between the training set and the validation 
set. As in jsl we have used the first 50 examples for training and the remain- 
ing 49 for validation. We report the error rate for each problem and several 
methods in tables 2 and 3. 



Problem 


k-nn 


QDA 


yes/no 


10% 


7% 


boat /goat 


21% 


35% 


sh/ao 


16% 


19% 



Table 2 

Error rate for reference methods (leave-one out) 

Table 2 has been reproduced from QDA corresponds to Quadratic Dis- 
criminant Analysis performed, as for fc-nn, on the projection of the data onto 
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Problem /Kernel 


linear (direct) 


linear (projection) 


Gaussian (projection) 


yes/no 


58% 


19% 


10% 


boat /goat 


46% 


29% 


8% 


sh/ao 


47% 


25% 


12% 



Table 3 



Error rate for SVM based methods (leave-one out) 

a finite dimensional subspace induced by the Fourier basis. Table 3 gives re- 
sults obtained with SVMs. The second column, "linear (direct)", corresponds 
corresponds to the direct application of the procedure described in 3.1.2, with- 
out any prior projection. This is in fact the plain linear SVM directly applied 
to the original data. The two other columns corresponds to the SVM applied 
to the projected data, as described in section 5.2. 

The most obvious fact is that the plain linear kernel gives very poor per- 
formances, especially compared to the functional kernels on projections: its 
results are sometimes worse than the rule that affects any observation to the 
dominating class. This shows that the ridge regularization of problem {R\) is 
not adapted to functional data, a fact that was already known in the context 
of linear discriminant analysis [13]. The projection operator improves the re- 
sults of the linear kernel, but not enough to reach the performance levels of 
k-nn. It seems that the projected problem is therefore non linear. 

As expected, the functional Gaussian SVM performs generally better than k- 
nn and QDA, but the training times of the methods are not comparable. On 
a mid range personal computer, the full leave-one-out evaluation procedure 
applied to Gaussian SVM takes approximately one and half hour (using LIB- 
SVM jsl embedded in the package el071 of the R software |28|), whereas the 
same procedure takes only a few minutes for k-nn and QDA. 

The performances of SVM with Gaussian kernel directely used on the raw data 
(in M^^^^) are not reported here as they are quite meaningless. The results are 
indeed extremely sensitive to the way the grid search is conducted, especially 
for the value of C, the regularization parameter. On the "yes/no" data set for 
instance, if the search grid for C contains only values higher than 1, then the 
leave-one-out gives 19% of error. But in each case, the value C = 1 is selected 
on the validation set. When the grid search is extended to smaller values, the 
smallest value is always selected and the error rate increases up to 46%. Similar 
behaviors occur for the other data sets. On this benchmark, the performances 
depend in fact on the choice of the search grid for C. This is neither the case 
of the linear kernel on raw data, nor the case for the projection based kernels. 
This is not very surprising as Gaussian kernels have some locality problems in 
very high dimensional spaces (see [l5|) that makes them difficult to use. 
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6.2 Using wavelet basis 



In order to investigate the limitation of the direct use of the linear SVM, we 
have applied them to another speech re cogni tion problem. We studied a part 
of TIMIT database which was used in The data are log-periodograms 



)5 



corresponding to recording phonemes of 32 ms duration (the length of each log 
periodogram is 256). We have chosen to restrict ourselves to classifying "aa 
against "ao", because this is the most difficult sub-problem in the database. 
The database is a multi-speaker database. There are 325 speakers in the train- 
ing set and 112 in the test set. We have 519 examples for "aa" in the training 
set (759 for "ao") and 176 in the test set (263 for "ao"). We use the split sam- 
ple approach to choose the parameters on the training set (50% of the training 
examples are used for validation) and we report the classification error on the 
test set. 

Here, we do not use a Fourier basis as the functions are already represented in 
a frequency form. As the data are very noisy, we decided to use a hierarchical 



wavelet basis (see e.g., |25|). We used the same penalty term as in 6.1. The error 



rate on the test set is reported in table 4. It appears that functional kernels are 



Functional Gaussian SVM 


Functional linear SVM 


Linear SVM 


22% 


19.4% 


20% 



Table 4 

Error rate for all methods on the test set 

not as useful here as in the previous example, as a linear SVM applied directly 
to the discretized functions (in M?^^) performs as well as a linear SVM on the 
wavelet coefficients. A natural explanation is that the actual dimension of the 
input space (256) is smaller than the number of training examples (639) which 
means that evaluating the optimal coefficients of the SVM is less difficult than 
in the previous example. Therefore, the additional regularization provided by 
reducing the dimension with a projection onto a small dimensional space is 
not really useful in this context. 



6.3 Spectrometric data set 



We study in this section spectrometric data from food industr}ii|. Each ob- 
servation is the near infrared absorbance spectrum of a meat sample (finely 
chopped), recorded on a Tecator Infratec Food and Feed Analyser (we have 
215 spectra). More precisely, an observation consists in a 100 channel spectrum 



^ Data are available at http: //www-stat . Stanford. edu/~tibs/ElemStatLearn/datasets/phoiieme .data 
^ Data are available on statlib at http: //lib. stat . emu. edu/datasets/tecator| 
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of absorbances in the wavelength range 850-1050 nm (see figure 1). The clas- 
sification problem consists in separating meat samples with a high fat content 
(more than 20%) from samples with a low fat content (less than 20%). 



Fat<20% 



Fat>20% 





900 950 1000 

Wavelength (nm) 



900 950 1000 

Wavelength (nm) 



Figure 1. Spectra for both classes 

It appears on figure 1 that high fat content spectra have sometimes two local 
maxima rather than one: we have therefore decided to focus on the curvature 
of the spectra, i.e., to use the second derivative. The figure 2 shows that there 
is more differences between the second derivatives of each class than between 
the original curves. 



Fat<20% 



Fat>20% 



c g 




95o lo'oo 

Wavelength (nm) 




Figure 2. Second derivatives of the spectra for both classes 

The data set is split into 120 spectra for learning and 95 spectra for testing. 
The problem is used to compare standard kernels (linear and Gaussian kernels) 
to a derivative based kernel. We do not use here the consistent procedure as 
we choose a fixed spline subspace to represent the functions so as to calculate 
their second derivative. However, the parameters C and a are still chosen by a 
split sample approach that divides the 120 learning samples into 60 spectra for 
learning and 60 spectra for validation. The dimension of the spline subspace 
is obtained thanks to a leave-one-out procedure applied to the whole set of 



input functions, without taking into account classes (see [33| for details). 



The performances depend of course on the random split between learning and 
test. We have therefore repeated this splitting 250 times (as we do not select 
an optimal projection dimension, the procedure is much faster than the one 
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used for both previous experiments). Table 5 gives the mean error rate of those 
experiments on the test set. 



Kernel 


mean test error 


Linear 


3.38% 


Linear on second derivatives 


3.28% 


Gaussian 


7.5% 


Gaussian on second derivatives 


2.6% 



Table 5 

Mean test error rate for all methods 

The results show that the problem is less difficult that the previous ones. 
Nevertheless, it also appears that a functional transformation improves the 
results: the use of a Gaussian kernel on second derivatives gives significantly 
better results than the use of an usual kernel (linear or Gaussian) on the 
original data (t-tcst results). The relatively bad performances of the Gaussian 
kernel on plain data can be explained by the fact that a direct comparison of 
spectra based on their L^(/i) norm is in general dominated by the mean value 
of those spectra which is not a good feature for classification in spectrometric 
problems. The linear kernel is less sensitive to this problem and is not really 
improved by the derivative operator. In the Gaussian case, the use of a func- 
tional transformation introduces expert knowledge (i.e., curvature is a good 
feature for some spectrometric problems) and allows to overcome most of the 
limitations of the original kernel. 



7 Conclusion 



In this paper, we have shown how to use Support Vector Machines (SVMs) for 
functional data classification. While plain hnear SVMs could be used directly 

on functional data, wc have shown the benefits of using adapted functional ker- 
nels. We have indeed define projection based kernels that provide a consistent 
learning procedure for functional SVMs. We have also introduced transforma- 
tion based kernels that allow to take into account expert knowledge (such as 
the fact that the curvature of a function can be more discriminant than its 
values in some applications). Both type of kernels have been tested on real 
world problems. The experiments gave very satisfactory results and showed 
that for some types of functional data, the performances of SVM based clas- 
sification can be improved by using kernels that make use of the functional 
nature of the data. 
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A Proofs 



In order to simplify the notations, we denote I = In when is obvious. We 



also denote X'^'^^ = Py^(X) and Xj = Py^ixi). 



The proof of the consistency result of is based on an oracle. We demonstrate 
a similar inequality: for A^ large enough, 



LU -L* < inf 

■' - d>i 



Ll-L*+ inf {Lfa - L*) + 4^ 



+ 



/32(/+ l)logm 



m 



+ 128A. 



32m(/ + 1) logm 



(A.l) 



where m = N — I, A = J2d>i \ JA^ '^^^'^^ < +^ ^'^'i is the Bayes error for 
the projected problem, i.e. ~L\ = ini f.^d^{_i.iy¥{f{X^'^^) ^ Y). 

Following we see that the definition of a* = (c?*, A'*,C*) leads to. 



m 



m 



for all a = (d, C, K) in A = {Jd>i{d} x Jd x [0, C^]. Then, for all e > 0, 
P (lU -Lfa>^ + e]<¥ (lU - LU >^ + ^] 



d>l 



m 



< E ^(Lfid,c*,K)-Lf^d,c*,K)>^ + e) (A.2) 

In jif, the right part of the inequahty is bounded by the use of the union 
bound on A. Here, [0, Cd] is not countable and thus we can not do the same. 
We will then use the generalization capability of a set of linear classifiers via 
its shatter coefficient. Actually, when d and K are set, f{d,c*,K) is an affine 
discrimination function built from the observation projections and the kernel 



A PROOFS 



23 



K. More precisely, we have: 

for all X in X, fa{x^''^) = E <ynK{x^^\ x^''^) + b* 



n=l 



Then, has the form b + f where / is chosen in the set of functions spanned 
by {K{x^i \ .),..., K{xf'\ .)}. Let us denote by J^k{xi \ ■ ■ ■ , xl_'^^) this set of 
classifiers and, for all / in Txixf^-, • • • , xf*), we introduce V f = F{f{X^'^^) ^ 
Y I (xi, yi), . . . , (xi, yi)). By Theorem 12.6 in we then have, for all z/ > 0, 



P 



sup 



where S{Tk{x'i'\ . . . ^xf')^m) is the shatter coefficient of J-k{ 
that is the maximum number of different subsets of m points that can be 
separated by the set of classifiers J^K{xt\ ■ ■ ■ , x^"^^)- This set is a vector space 



\Lf-L^f\>u 



{xi,yi),...,{xi,yi)j < 

%S{TK{xf\...,xf),m)e-'-^'/'\ 



(d) 



of dimension less or equal to Z + 1, therefore according to chapter 13 of [11 
S{J^K{x'i\ x'i'^^),m) < m'+^ This implies that, for all {d, K) e N* x J, 



= E 
<E 



ixi,yi),. . . ,{xi,yi) 
{xi,yi),...,{xi,yi) 



P( sup \Lf - Uf\> ^ + e 



Combining (A. 2) and (A. 3), we finally see that 

P ( LU -Lfa> ^ + e] < 8Am'+^e-'"^'/32_ 

If Z is a positive random variable, we have obviously 

f + OO 

E{Z) < E(ZI|z>o}) = nZ > e) de. 
For Z = Lfa* — Lfa — this leads, for all a in Ud{c?} x 1^, x JT^, to 



(A.3) 



+ 00 



Lfa* < EiLfa) + ^ + / FlLU-Lfa> ^ + e\ de. 

\/m Jo \ Jm ] 
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Finally, following j^, for all m > 0, 

/•+00 / ^ \ , \ ru f+oo „ 

/ F { Lfa* - Lfa > ^ + e] de < ldt+ SAm'+^e"™^ de 
Jo \ vm J Jo Ju 

<u + 128Am'+i (- + At) e-™^'/32 

V16 me^/ 

and then 

LU < EiLfa) + ^ + u + l^e--V32. 
\/m u 



if we set u = \J^^^^^^^^^ and by the equality E(L/a) = Lfa-, we deduce that, 
for all a in 



-^m V ?Ti Y32(/ + l)logm 

which finally proves oracle (A.l). 
We conclude thanks to the following steps: 



(1) hm^^+oo 320+2 log m _^ = from the assumptions of 
Theorem 1; 

(2) Lemma 5 in [3] shows that LjJ — L* ^^+°°> q- 

(3) Let e > 0. If we take a rfg such that, for all d > do, L*^ — L* < e. To 
conclude, we finally have to prove that 



This is a direct consequence of Theorem 2 in [40l]. Let us show that the 
hypotheses of this theorem are fulfilled: 

(a) Theorem 2 in 40|] is valid for universal kernels that satisfy some 
requirements on their covering numbers. 

As we focus on 'mi(^c,K)&-ii^xja^ Lf(^do,c,K), we can choose freely the 
kernel and the regularization parameter in X^^ x JT^q. Therefore, we 
choose an universal kernel with covering number of the form 
O{e~'^''o) for some z/^q > (this is possible according to our hypothe- 
ses). 

(b) Theorem 2 in 4^ asks for X^'^^ to take its values in a compact set of 

Actually, X is bounded in X so, by definition of x — > 
takes its values in a bounded set of R'' which is included in a compact 
set of M^; 

(c) Finally, Theorem 2 in i^l requests a particular behavior for Ci, the 
regularization parameter used for / examples: Ci is such that ICi — > 
+00 and Q = for some < /3 < ^. 
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Let Pdo be any number in 
infimum between a and b). T 



0, 



A 1 



(where a A b denotes the 



"hen, let Ci be /^'*o~^. This defines a 
sequence of real numbers included in ]0, 1[ which fulfills the require- 
ments stated above. As C^g > 1 for all / > 2, we have Ci G [0,Cdo] 
therefore such choice of the regularization parameters is compatible 
with the hypothesis of our theorem. 
This allows to apply Theorem 2 in 4^ which implies that -^/(do,(Ci),^do) 
converges to LjJ^ and finally to obtain the conclusion. 
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